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Abstract 

This paper describes work to enhance a sentence- 
based summarizer with notions of salience, dynamically- 
adjustable summary size, discourse segmentation, and 
awareness of topic shifts. Our experiments study strate- 
gies to diversify the application of a baseline summarizer, 
by making it aware of finer-grained 'aboutness ', capable of 
discerning changes of topic, and sensitive to longer-than- 
usual documents. Evaluated against the corpus used in 
the development of the baseline summarizer, summaries de- 
rived either by means of segmentation analysis alone, or by 
a mix of strategies for combining salience calculation and 
topic shift detection, are shown to be of comparable, and 
under certain conditions even better, quality. We describe 
the summarization and segmentation procedures, outline a 
number of strategies for mixing the two, evaluate the overall 
impact of discourse segmentation, and suggest an interface 
design capable of using the notion of topic shifts to contex- 
tualize a summary and facilitate the mediation between it 
and the full document source. 



1. Introduction 

Document summarization has become de facto a critical 
component in any toolkit for on-line information manage- 
ment, as witnessed at least by dedicated conferences and 
symposia [1], coordinated evaluation initiatives [12], and 
real-world deployment [7]. Still, in the absence of a coher- 
ent theory of summarization, and even less so of a formal 
computational model of summary derivation, virtually all 
general purpose summarizers (whether in wide deployment, 
or of a more experimental nature) currently use variations 
on the same theme: they compose a summary by 'stitching 
together' representative fragments — typically sentences — 
from the original full length document text. 

This strategy is sub-optimal, as users have to contend 
with loss of coherence, deterioration of readability, and the- 
matic under-representation [4]. To a large extent all of these 
problems stem from arbitarily long passages of the origi- 
nal document being omitted between any two adjacent sen- 
tences in the summary; thus loss of essential information 1 

! For instance: a "dangling" anaphor, without an antecedents; the rever- 



interferes with the intended use of the summary. 

Even if users are prepared to compromise, in order to get 
some idea of what a document is about without having to 
read all of it, such factors lead to rapid degradation of the 
usefulness of a sentence-based summary in situations be- 
yond the most typical <4 what is this news story about". Ex- 
amples of such situations might include: occasions when 
traditional methods are applied to documents larger than 
a couple-of-pages-long news article; when the user needs 
more complete awareness of all major themes in a docu- 
ment; or when different summaries might be appropriate to 
different user information-seeking contexts. 

We have chosen to address such problems by enhancing 
a sentence-based summarizer with notions of salience (as 
determined with respect to a background document collec- 
tion) and dynamically-adjustable size of the resulting sum- 
maries (see [25], and below). However, by focusing on 
salience as a solution to one set of problems, we become 
dependent on statistics of a background collection, which 
clearly limits the applicability of the summarizer across a 
range of document types and genres. 2 Furthermore, it is far 
from clear that salience alone offers a complete solution to 
the problems of incoherence and thematic underrepresenta- 
tion: for instance, it is not clear how to use it in environ- 
ments where it is essential to track all the topics/sub-stories 
in the original document, or to remain sensitive to changing 
user profiles and interests. 

This paper describes some early work on leveraging ele- 
ments of the larger discourse structure in an attempt to en- 
hance the operation of a salience-based sentence extraction 
summarizer. In the longer term, this is just one aspect of a 
larger study on the recognition and use of cohesive devices 
for a variety of content characterisation tasks. As such, it 
presupposes fine-grained methods for the identification of 
cohesive ties between (sentence) units in a text; such ties 
are typically manifested in textual substitution, lexical repe- 

sal of a core premise in an argument; the introduction, and/or elaboration, 
of a new topic— these are just a few examples of missing essentials. 

2 It it is possible to supply a 'generic* background collection, against 
which summaries could be generated even for documents which are not a 
priori part of the collection. This is problematic, at least because it is a 
highly genre-dependent approach. In addition, the generation of a back- 
ground collection and statistics for it might be impractical for a variety 
of reasons: lack of access to a sufficiently large and representative data 
sample; no time for processing; sparse storage resources; and so forth. 



tition, co-reference and ellipsis, paraphrasing, conjunction, 
and so forth. Even if a framework for such analysis takes 
a while to implement, in the immediate term a 'working 
approximation' is provided by the phenomenon of simple 
lexical repetition. We use this to develop an operational 
definition of discourse segmentation, where segments in a 
document are defined to be contiguous blocks of text (typi- 
cally spanning several paragraphs), roughly 'about the same 
thing v with segment boundaries indicative of topic shifts, 
and/or changes in themes of discussion. 

1.1. Segmentation-assisted summarization 

In our work on enhancing summarization by folding in 
results of linear discourse segmentation, we appeal to a 
number of common intuitions. In general, we focus on 
strategies to diversify a summarizer, by making it aware of 
finer-grained 'aboutness', capable of discerning topic shifts, 
and sensitive to longer-than-usual documents. In a sentence 
extraction-based model of summarization, making certain 
that a Summary incorporates sentences from each segment 
seeks to ensure uniform representation of all sub-stories in 
a document; the notion here is to avoid having inordinately 
large gaps between two adjacent summary sentences, which 
would tend to lose essential information. Moreover, assum- 
ing a mechanism which would pick the sentence(s) within 
a segment which are representative of the main topic dis- 
cussed in the segment, such a selection strategy would carry 
over into the summary 'traces' of all the main topics in the 
original document. 

This is more than just an intuition. In the process of 
developing, and training, the base summarization function 
described below (Section 2.2), an analysis was carried out 
to determine the causes of a certain class of failure. 3 It 
turns out that 30.7% of the failures could be prevented by 
a heuristic sensitive to the logical structure of documents, 
which would enforce that each section gets represented in 
the summary. Additional 15.2% of failures could also be 
avoided if the summarizer was capable of detecting sub- 
stories within a single section, leading/trailing noise (see 
below), and so forth. Thus almost half of the errors (in this 
particular task, at least) could have been avoided by using a 
segmentation component. 

The specific strategies for being sensitive to foci of atten- 
tion within a segment, and topic shifts between segments, 
may vary, depending on other environment settings for the 
summarizer; we return to this question below (Section 3). 
As we shall see, even very simple approaches — say, take 
the first sentence from each segment — have remarkably no- 
ticeable impact in certain situations. 

While segmentation offers plausible schemes for deriv- 
ing sentence-based summaries with certain discourse prop- 

3 In a task-based evaluation protocol (see Section 3.1 below), quality of 
summaries was assessed by using them to determine whether a document is 
relevant to a query or not The evaluation environment provided a training 
corpus, against which the summarizer was developed, and which was used 
as the basis for our analysis. 



erties, it turns out that at least one such scheme also al- 
lows the summarizer to operate — in certain cases very ef- 
fectively — without a need for background corpus statistics. 

Another use for a segmentation component in summa- 
rization context is for optimising the use of source input, 
as well as possibly maximising its re-use. Occasionally, 
the document contains 'noise' — this may be in the form 
of anecdotal leads, closing remarks tangential to the main 
points of the story, side-bars, and so forth — which should 
not be considered as source for summary sentences. Linear 
segmentation sensitive to topic shifts and document struc- 
ture would identify such source fragments and remove them 
from consideration by the summarizer. Conversely, in cer- 
tain genres of news reporting a whole document fragment 
(typically towards the beginning or the end of the document) 
functions as a summary of the story: we would like to be 
able to use this fragment; clearly identifying it as a segment 
is part of the whole task. 

We also use segmentation to handle long documents 
more effectively. While the collection-based salience deter- 
mination works reasonably well for the average-length news 
story, it has some disadvantages. For longer documents, 
with requisite longer summaries, the notion of salience de- 
generates, and the summary takes on more of the appear- 
ance of an incoherent collection of sentences. In certain 
contexts, paragraph-, rather than sentence extraction, has 
been proposed as a working solution; see e.g. [37]. Apart 
from inherently suited for longer texts, due to its larger 
granularity, this suffers from the same problems of patch- 
iness and/or under-representation brought up earlier in this 
section [23]. We use segmentation to identify contiguous 
sub-stories in long documents, which are then individually 
passed on to the summarizer; the results of sub-story sum- 
maries are 'glued' together. 

The remainder of this paper is organized as follows. 
Section 2 presents an overview of the document process- 
ing infrasructure within which the summarization function 
is just one component, and gives some details about the 
processes of summary generation and linear discourse seg- 
mentation. We focus in particular on how the higher level 
content analysis functions make use of lower level shallow 
linguistic processing, in order to obtain a richer model of 
the document(s) domain, and to leverage a cohesion metric 
for sub-story identification. Section 3 presents the results 
from a number of experiments, comparing the performance 
of summarization alone to segmentation-enhanced summa- 
rization; to set the context, we outline the evaluation testbed 
environment we use. Following a discussion of the results, 
which suggest specific run-time strategies for optimally us- 
ing the notions of discourse segments and topic shifts for 
summarization, we outline some core features of an inter- 
face which tries to make 'visual sense' of the notions we use 
(salience, topics, summary sentences, discourse segments, 
context, and so forth). We conclude with an assessment 
of the overall utility of 'cheap' approximations to lexical 
coherence measures, specifically from the point of view of 
enhancing a frilly operational summarizer engine. 



2. Background technologies 

Unlike most operational summarization systems to date, 
the one discussed here is an integral component of a much 
larger infrastructure for document processing and analysis, 
comprising a number of interconnected, and mutually en- 
abling, linguistic filters. The whole infrastructure (hereafter 
referred to as Textract) is designed from the ground up to 
perform a variety of linguistic feature extraction functions, 
ranging from straightforward, single pass, tokenisation, lex- 
ical look-up and morphological analysis, to complex aggre- 
gation of representative (salient) phrasal units across large 
multi-document collections. To a large extent these char- 
acteristics of our document processing environment define 
the basic design decisions concerning the specifics of our 
summarizer: sentence selection based upon salience rank- 
ing of phrasal units in individual documents, against a back- 
ground of the distribution of phrasal vocabulary across a 
large multi-document collection. 

2.1. Textract infrastructure 

For the purposes of this paper, Textract can be viewed 
as a robust text analysis system that identifies proper names 
and technical terms, along with their variants (contractions, 
abbreviations, colloquial uses, and so forth) in individual 
documents in a multi-document collection, and builds a col- 
lection vocabulary of canonical forms and variants with sta- 
tistical information concerning their distribution behaviour 
and prominence patterns across the collection. The collec- 
tion vocabulary and statistics are used in the summarizer's 
salience calculation, which, in turn, is a significant compo- 
nent of the sentence-level score that selects the sentences 
for extraction. 

Most of the linguistic analysis of Textract utilized by 
the summarizer is derived through a variety of shallow tech- 
niques. This is partly motivated by the requirements of 
an operational and robust system capable of efficient pro- 
cessing of thousands of documents/gigabytes of data. Al- 
ternatively, this can be viewed as an ongoing investigation 
into how much of higher level semantic and discourse func- 
tions can be realized from a shallow linguistic base [18]. In 
any case, we disagree with claims that morphological anal- 
ysis and multi-word identification would complicate pro- 
cessing, without benefit to function (see, for instance, [5]). 
The Textract system, known commercially as Intelligent 
Miner for Text, is an IBM product which has been success- 
fully deployed in a number of operational information man- 
agement environments (see, for instance, [6], [27]); its sum- 
marizer component is comparable in performance to other 
industry-strength state-of-the-art technologies [21]. 

As a fundamentally frequency-based system, the sum- 
marizer is ideally positioned to exploit Textract's func- 
tions for linguistic analysis, filtering, and normalization. 
Thus, morphological processing allows us to link multi- 
ple variants of the same word, by normalizing to lemma 
forms. A proper name identifier, Nominator, [34] not 



only marks "Bill" as a name=>person, but also distin- 
guishes between it and "HIV*, thus reducing noise in the 
frequency counting [39]. Further, its ability to identify "Bill 
Clinton " and "Clinton " as variants of the same name boosts 
the frequency of the concept (and ultimately its salience) in 
the document. Similarly, a light-weight component for re- 
solving definite noun phrase anaphora identifies "the law 
firm " and later "the firm " as co-referring, allowing both 
to be counted together. The interaction of Nominator 
with Abbreviator makes it possible to recognize "Amer- 
ican Bar Association*' and its variant "ABA" as also co- 
referring. Yet a different component, terminator, im- 
plements a version of technical terminology identification 
and extraction [16]; this enables the recognition of certain 
multi-word concepts mentioned in the document, with dis- 
course properties which reflect high topicality value, which 
is also directly relevant to salience determination. The in- 
teraction of Nominator with Terminator makes it pos- 
sible to analyze "Treasury bill" and "Alzheimer *s disease" 
as multi-word phrasal units. 

In the analysis of a multi-document collection, each doc- 
ument is analyzed individually. All 'content' words (non- 
stop words, in Information Retrieval terminology), as well 
as all the phrasal units identified by the Textract linguis- 
tic filters, are deemed to be vocabulary items, indexed via 
their canonical forms. With a view to future extensions of 
the base summarization function (see Section 5), these re- 
tain complete contextual information about the variants in 
which they have been encountered, as well as the local con- 
text of each occurrence. The vocabulary items are counted 
and aggregated across documents to form the collection vo- 
cabulary. Aggregating together similar items from differ- 
ent documents (cross-document co-reference) is far from 
straightforward for multi-word items; however, being able 
to carry out a process of cross-document coreference reso- 
lution is clearly a further enabling capability for obtaining 
more precise collection statistics [33], 

In addition to the domain vocabulary, the summarizer 
also has access to the document structure provided by the 
Textract base. The document structure builder produces 
a structural representation of the document, which carries 
explicit identification of content and layout metadata. These 
include: appearance and layout tags; document title; ab- 
stract, and other front matter; section, subsection, etc. head- 
ings; paragraphs, themselves composed of sentences; ta- 
bles, figures, captions, and other 'floating' objects; side- 
bars and other kinds of text extraneous to the main docu- 
ment narrative; and so forth. At present, document structure 
is constructed by 'shadowing* markup parsing, as markup 
tags are used to construct the document structure tree. For 
documents which lack markup tags, a separate component, 
Layser (LAYout parSER), facilitates the document struc- 
ture builder by carrying out structure determination on the 
basis of two-dimensional (page) layout cues. Additional 
discourse-level annotations may also be recorded in the doc- 
ument structure, such as cue phrases marking rhetorical re- 
lations, quoted speech, and so forth. 



2.2. Summarization component 

The Textract summarizer was explicitly designed to 
leverage Textract's linguistic filters for the analysis of 
documents. It is a frequency-based system; however, due 
to the depth of analysis by the filters (see Section 2.1), it 
is able to exploit a richer source of domain knowledge than 
most other frequency-based systems. We are not alone in 
exploiting linguistic dimensions beyond single word anal- 
ysis (see [2], for instance, for a sentence-based summa- 
rizer using multi-word sequences). The motivation for such 
an approach— intuitively, lack of discourse processing ad- 
versely affects the quality of an abstract— has been formu- 
lated a^wnile ago [29], and reiterated since [15], [28]; but it 
is only recently that robust shallow and scalable techniques 
have been developed for unconstrained texts. 

Early frequency-based techniques for sentence selection 
were disappointing compared to other methods, such as 
those leveraging sentence location and/or cue words and 
phrases (such as "The purpose of this paper "In sum- 
mary and so forth) [9] because frequency alone is a poor 
indicator of salience of terms, even when the stop words are 
ignored. More indicative is the inverse document frequency 
technique, adapted from information retrieval (proposed by 
[5] in the context of summarization in particular, it follows 
[36]), in which the relative frequency of an item in the doc- 
ument is compared with its relative frequency in a back- 
ground collection. 

The sentence selection process is based on a notion of 
salience; the most salient sentences identified are extracted 
for the summary. The salience score of a sentence is de- 
rived partly from the salience of vocabulary items (includ- 
ing single-token words, multi-word names, abbreviations, 
and multi- word terms, but excluding stop words) in the doc- 
ument and partly from its position in the document struc- 
ture (e.g. section-initial, paragraph-internal, and so forth) 
and the salience of the surrounding sentences. The vocabu- 
lary items from the document are looked up in the collection 
vocabulary database by a statistical component that calcu- 
lates, for each item, its inverse document frequency. This 
calculation compares the relative frequency of each item t 
in the document with the relative frequency of the item in 
the collection. This inverse document frequency measure is 
the item's salience score. 

M , N c /freq(t)c 
Sahence(t) = \og 2ND/freq{t)D 

Salient items (signature terms, after [5]) are the items 
occurring more than once in the document, whose salience 
score is above an experimentally determined cutoff, or ap- 
pearing in a strategic position in the document structure 
(e.g. title, headings, etc.). All other items are assigned zero 
salience. 

The score for a sentence is made up of two components. 
The salience component is the sum of the salience scores 
of the items in the sentence. The structure component re- 



flects the sentence's proximity to the beginning of the para- 
graph, and its paragraph's proximity to the beginning and/ 
or end of the document. Structure score is secondary to 
salience score; sentences with no salient items get no struc- 
ture score. Still, a low- or non-scoring sentence might be 
selected, anyway: thus sentences that immediately precede 
higher scoring ones in a paragraph may get promoted by 
virtue of an 'agglomeration rule', the operation of which 
is controllable from the client interface. Agglomeration 
addresses the problem of coherence discussed earlier (see 
Section 1); it is an inexpensive way of preventing dangling 
anaphors without having to identify them. 

Another problem for sentence-based summarizers, also 
discussed in Section 1 above, is that of thematic under- 
representation (or, loosely speaking, coverage). This is ad- 
dressed by another rule, the 'empty section' rule, which is 
of particular interest for this paper. Longer documents with 
multiple sections marked with headings, or news digests 
containing multiple stories may be unevenly represented in 
a sentence-extracted summary. The 'empty section' rule 
aims to ensure that each section is represented in the sum- 
mary by forcing inclusion of its highest scoring sentences, 
or, if all sentence scores are zero, its first sentence. 

In general, there are some exclusions to the sentence se- 
lection process. For example, sentences are excluded if they 
are too short (five words or less) or if they contain direct 
quotes (more than a minimum number of words enclosed in 
quotation marks) 4 . 

The summarization component described here performs 
best on documents within a certain genre: in effect, it as- 
sumes input of the type and length of a news story or news 
feature story (article). Furthermore, the requirement for a 
database of background statistics is clearly a crucial part 
of its design. This raises the two questions which are the 
point of departure for this paper. The first is how to han- 
dle situations where the input documents are longer, pos- 
sibly significantly so, than the average length of a news 
story. The second concerns summarization of documents 
for which no background collection exists. Clearly, neither 
of these situations is extraordinary. It is easy to conceive 
of document collections in a different genre: scientific arti- 
cles, patent descriptions, financial reports, and so forth, all 
exhibit length significantly beyond what the current summa- 
rizer is designed to represent. Furthermore, new documents 
are created all the time; by definition, these do not belong 
to any background collection. It may take time to accumu- 
late such a collection and analyze it; it may be impractical 
to store the vocabulary statistics of such a collection; it may 
be the case that existing collections do not adequately re- 
flect the domain and genre of new documents. 

We have chosen to address these two questions by mak- 
ing the summarizer aware of certain discourse-level features 
of the document by leveraging the topic shifts in it; to this 
end, the Textract infrastructure has been augmented with 
a function for linear discourse segmentation. 

4 N 0 te that as a result of the document structure constructed for each 
source text, such considerations are trivial to implement 



2.3. Discourse segmentation component 

Our long term goal is to bring a degree of discourse 
awareness into the summarization process. Our approach 
is to make extensive use of lexical cohesion. 

Discourse segmentation is driven by the determination of 
points in the narrative where perceptible discontinuities in 
the text cohesion are detected. Such discontinuities are in- 
dicative of topic shifts. Following the original idea of [24], 
subsequently developed specifically for the purposes of seg- 
mentation of expository text [13], we have adapted an algo- 
rithm for discourse segmentation to our document process- 
ing environment. In particular, while remaining sensitive to 
the distribution of "terms" across the document, and calcu- 
lating similarity between adjacent text blocks by a cosine 
measure, our procedure differs from that in [13] in several 
ways. 

- We only take into account content words (as opposed to 
all terms yielded by a tokenization step). These are normal- 
ized to lemma forms. "Termhood" is additionally refined to 
account for multi-word sequences (proper names, technical 
terms, and so forth, as discussed in Section 2.1 above), as 
well as some (limited) notion of co-reference, where differ- 
ent name variants get "aggregated" into the same canonical 
form ([39]). The cohesion calculation function is biased to- 
wards different types of possible break points: thus certain 
cue phrases ( "However "On the other hand") unambigu- 
ously signal a topic shift; document structure elements — 
such as sentence beginnings, paragraph openers, and sec- 
tion heads — are exploited for their 'pre-disposition' to act 
as likely segment boundaries; and so forth (see Section 2.1). 
The function is also adjusted to reduce the noise from block 
comparisons where the block boundary — and thus a poten- 
tial topic shift — falls at unnatural break points (such as the 
middle of a sentence). 

Modulo the above adjustments and modifications, we use 
essentially the same formula as Hearst's for computing lex- 
ical similarity between adjacent blocks of text &i and b 2 
(t denotes a discourse element term identified as such by 
Textract's prior processing, ranging over the text span 
of the currently analyzed block; ut,b N ) is the normalized 
frequency of occurrence of the term in block b n): 



Throughout the 1WK the Soviet Union 




In essence, we are able to utilize, transparently, the re- 
sults of processes such as lexical and morphological lookup, 
document structure identification, and cue phrase detection, 
because these are already integral parts of our document 
processing environment (Textr act). Likewise, the results 
of the segmentation process are naturally incorporated in an 
annotation superstructure which records the various levels 
of document analysis: discourse segments are just another 
type of a "span" over a number of sentences, logically akin 
to a paragraph. 

Figure 1 illustrates the results of the 'raw' segmentation 



But nritfag ogp<t b ottfefa fl txw Cttpp Modf iht Afghan fatJy wwrfwi hww tfafl mo mi i ^ w 
history of repelling superpowers. Iti terrain bvor* defender* «* well *» any in the world, whether their opponent*, lik. the 
Soviet*. «r* trying to defeat them on the pound or whether like the United States they «rc trying to disperse, deter and disrupt 
them. It a uncertain that the United States, whkh fired doraa of million-dollar cruise missiles «t thoxt wine cemps on Thursday 
can do better than the Soviet*- 



il wax against Soviet troop* fetw December 1979 to February 1 989, according to American 

The Afghan resistance waf backed by the Intelligence service* of the United State* and Saudi Arabl* with neatly $6 billion 
worth of wespms. And the territory targeted last week. • set J tlx txximpn^ Kl^ who* tte S*»dl Oum* 

bin Laden has fins need ■ kind of " terrorist university," in the word* of . senior U.S. bi ~ 
OA. 



The OA'* military and financial support for the Afghan rehab indirectly helped build the camp* that tha United State. 
' seme of the "nv: w«™ n who fought the Soviets with the OA'i help «i* now fighting under bin laden'* banner, 
e same camps, the Afghan rebel*, known as mujihedeen. or holy warrior*, kept up a d ec ed cton g siege on the 



e dug Into the mountain around Khost Soviet account* of the siege cf Khost daring 1! 
referred to die rebel camps e* "the bst word fat NATO cngineaing technique*.* After • decade cf fighting during whkh a 
lid* daimed to hi vc killed thousand* of the enemy, the Afghan rebeb poured out of their encampmrotx snd took Khost. 



fiercrfy^ontefted piece of real estate in the Ifryvu Afghan war." taid Mill Besiden. who ran the OA's 

intended to deter bin Laden, whom they caD the financier and intellectual author of this 
mbasslc* In Africa, which killed 263 people, toch»dJngl2Ainerieans.THeyaald»hedainaga 
d on the Khost camps was "moderate to heavy." 



Figure 1: 'Raw' discourse segmentation: topic shifts 

Informally, as a 'gloss' on the illustration above, the foci of 
the five segments could be described as: 

□ Afghan camps thwart Soviets; 

□ Afghanistan history in repelling superpowers; 

□ Afghan resistance and US/Arab intelligence; 

□ Afghan rebels, and the siege of Khost; 

□ Target: Osama bin Laden. 

Most other applications of segmentation, typically in in- 
formation retrieval, are primarily concerned with identify- 
ing segment boundaries: [14], [37], [30], [3], [35]. We are 
additionally interested in leveraging the content of the seg- 
ments, to the extent that it is indicative of the focus of at- 
tention, and (indirectly, at least) points at the topical shifts 
which we need to utilize for the summary generation. 

While it is unrealistic to expect that this kind of 'sum- 
mary' could be automatically generated, it is our intent to 
use the segmentation results (together with the name and 
term identification and salience calculation delivered by 
other parts of Textract) in order to make sure that all 
the base data for inferring the topic stamps, and topic shifts, 
is available to the user. 

This raises two related questions. The first concerns the 
relationship between segmentation and summarization: is 
segmentation a strictly "under the covers", service, func- 
tion used by the summarizer, or might the results of dis- 
course segmentation be of any interest, and use, to the end 
user? Unlike [17] (whose work also seeks to leverage linear 
segmentation for the explicit purposes of document summa- 
rization), we take the view that with an appropriate interface 
metaphor, where the user has an overview of the relation- 
ships between a summary sentence, the key salient phrases 
within it, and its enclosing discourse segment, a sequence of 
visually demarkated segments can impart a lot of informa- 
tion directly leading to the formulation of glosses like the 
one illustrated earlier. The second question thus concerns 
the features of such an interface. We return to this point 
later. 



3. Discourse-aware summarization 



As discussed in Section 1.1 above, common intuitions 
suggest a number of strategies for leveraging the results 
of linear discourse segmentation for enhancing summariza- 
tion In our testbed environment, we arranged for segmen- 
tation to 'publish' the topic shift points in the text into the 
document structure, by defining a segment as an additional 
type of document span (not dissimilar to sentence, para- 
graph, section, and so forth), with its own from and to 
coordinates; the summarizer thus transparently and imme- 
diately^ became aware of the segmentation results. We fur- 
ther arranged for a mechanism whereby certain strategies 
for incorporating segmentation results into the summariza- 
tion process were easy to cast in summarizer terms. Thus, 
for instance, a heuristic which would require that each seg- 
ment is represented in the summary is naturally expressed 
bv treating segments as sections, and strictly enforcing the 
•empty section' rule (see 2.2); a strategy which requires the 
selection of a segment-initial sentence for the summary is 
enabled simply by boosting the salience score for that sen- 
tence above a known threshold; a decision to drop an anec- 
dotal segment from consideration in summary generation 
would be realised by setting, as a last step prior to summary 
generation, the sentence salience scores for all sentences in 
the segment to zeros. 

For evaluating the effect of various strategies upon sum- 
marizer output quality, we used as baseline an evaluation 
corpus of full-length articles, and their 'digests , from The 
New York Times. There are advantages, and disadvantages, 
to this approach. Setting aside the issue of whether task- 
based evaluation (see below) is the appropriate mode for 
testing strictly the effect of one technology on another (see 
below, Section 3.1), such a decision ties us to a particular 
set of data. On the positive side, this offers a realistic base- 
line against which to compare strategies and heuristics; on 
the negative side, if a certain type of data is missing from 
the evaluation corpus, there is little hard evidence for judg- 
ing the effects of strategies and heuristics on such data. In 
our particular case, even though an aspect of our investi- 
gation focused specifically at adequately summarizing long 
documents, the absence of such documents from the corpus 
prevents us from doing quantitative comparisons between 
summarizer output without, and with, segmentation. 

At the time of writing, we are working with a customer 
organization with a need for summarizing long documents; 
we hope to be able to report the results of task-based evalu- 
ation in situ in due course. In the remainder of this section 
we focus on presenting the results for small-to-average size 
documents (the collection comprises just over 800 texts, 
less than half of which are over 10K, and virtually none 
are over 20K; the byte count includes html markup tags 
in terms of number of sentences per document, very few of 
these longer documents are over 100 sentences long). First, 
we describe the evaluation environment. 



3.1. Summarization evaluation testbed 

Evaluating summarization results is not trivial. There is 
evidence that the optimal extract is not unique [32], [8]. 1 he 
purpose of the extract varies; so do human extractors. Sen- 
tence extraction systems may be evaluated by comparing 
the extract with sentences selected by human subjects [32], 
[101 a (superficial) objective measure that ignores the pos- 
sibility of multiple right answers. Another objective mea- 
sure compares summaries with pre-existing abstracts using 
a suitable method for mapping a sentence in the abstract to 
its counterpart in the document [19]. Subjective measures, 
even though still less satisfying, can also be devised: tor 
instance, summary acceptability has been proposed as one 
such measure [5]. Other evaluation protocols share the pri- 
mary feature of being task-based, even though details may 
vary: performance may be measured by companngbrows- 
ing and search time as summary abstracts and full-length 
originals are being used [22], [38]; recall and precision in 
document retrieval [5]; or recall, precision, and time re- 
quired in document categorization (i.e. assessing whether 
a document has been correctly judged to be relevant or not, 
on the basis of its summary alone) [1 1], [12]. 

During the development of the base summarization func- 
tion in TEXTRACT, we built an environment for baseline 
evaluation 'in-house', as part of the development/training 
cycle This same environment was used in analyzing the im- 
pact of discourse segmentation on the summarizer's perfor- 
mance. A background collection vocabulary statistics were 
gathered from analyzing 2334 New York Times news stones. 
Sentences in digests for 808 news stories and feature articles 
were automatically matched with their corresponding sen- 
tences in the full-length documents using a version of LIN- 
guini, a vector-based language identification program L3 lj 
that was able to map source to digest sentences even when 
slight differences existed between the two. Digests range 
in length from 1 to 4 sentences. Since we were particularly 
interested in longer stories, as well as stories in which the 
first sentence in the document did not appear in the digest, 
their representation in the test set, 38%, is larger than their 
distribution in the newspaper. 

A limitation of this inexpensive test approach is the in- 
herently short length of the digests, which prevents us from 
evaluating segmentation effects on summanzation ot long 
documents. Nonetheless, a number of comparative analy- 
ses can be carried out against this baseline collection, which 
are indicative of the interplay of the various control options, 
environment settings, and TextraCT filters used. One pa- 
rameter, in particular, is quite instrumental m tuning the 
summarizer's performance, to a large extent because it is 
directly related to length of the original document: size ot 
the summary, expressed either as number of sentences, or as 
percentage of the full length of the original. In addition to 
a clear intuition-size of the summary ought to be related 
to the size of the original— varying the length of the sum- 
mary offers both the ability to measure the summarizer s 
performance against baseline summaries (i.e. our collection 



of digests), and the potential of dynamically adjusting the 
derived summary size to optimally represent the full docu- 
ment content, depending on the size of that document. 

We conducted our experiments with different granular- 
ities of summary size. In principle, the performance of 
a system which does absolute sentence ranking, and sys- 
tematically picks the N 'best* sentences for the summary, 
should not depend on the summary size. In our case, the 
additional heuristics for improving the coherence, readabil- 
ity, and representativeness of the summary (see Section 2.2) 
introduce variations in overall summary quality, depending 
on the compaction factor applied to the original document 
size. A representative spectrum for the test corpus we use 
is given by data points at: digest size (i.e. summary exactly 
the size, expressed as number of sentences, of the digest); 
4 sentences; 10% of the size of the full length document; 
and 20% of the document. Not surprisingly (for a salience- 
based system), the summarization function alone, without 
discourse segmentation, benefits from larger summary size. 
Although the recall rate is higher still for longer summaries, 
it is not a measure of the overall quality of the summary be- 
cause of the inherently short length of the digest. 

3.2. Segmentation effects on summarization 

Elaborating the intuitions outlined in Section 1.1, our ex- 
periments compare the base summarization procedure, cal- 
culating object salience with respect to a background doc- 
ument collection (Section 2.2), with enhanced procedures 
incorporating several different strategies for leveraging the 
notions of discourse segments and topic shifts. 

The experiments fall in either of two categories. In an 
environment where a background collection, and statistics, 
cannot be assumed, a summarization procedure was defined 
to take selected (typically initial) sentences from each seg- 
ment; this appeals to the intuition that segment-initial sen- 
tences would be good topic indicators for their respective 
segments. The other category of experiment focused on en- 
riching the base summarization procedure with a sentence 
selection mechanism which is informed by segment bound- 
ary identification and topic shift detection. 

In combining different sentence selection mechanisms, 
several variables need adjustment to account for relative 
contributions of the different document analysis methods, 
especially where summaries can be specified to be of differ- 
ent lengths. Given the additional sentence selection factors 
interacting with absolute sentence ranking, we again set the 
granularity of summary size at three discrete steps, mirror- 
ing the evaluation of the original summarizer: summaries 
can be requested to be precisely 4 sentences long, or to re- 
flect source compaction factor of 10% or 20% (Section 3.1). 

In general, we experimented with two strategies for ac- 
tively incorporating topical information into the summary: 
one was to add the segment-initial sentences to the set of 
sentences already selected by the salience calculation mech- 
anism, the other was to exert finer control over the number 
of sentences selected via salience, and 'pad' the summary to 



its requested size with sentences selected from segments by 
invoking the 'empty segment' (aka 'empty section', see 2.2) 
rule. Special provisions were made to account for the fact 
that segmentation would naturally always select the first 
sentence in the document. 

It turns out that the differences between a range of re- 
alisations of the above two strategies are not statistically 
significant over our test corpus; we thus use the label 
"SUM+SEG" to denote a 'composite' strategy and to rep- 
resent the whole family of variations. In contrast, "SUM" 
refers to the base summarization component, and "SEG" 
represents summarization by segmentation alone. Table 1 
below shows the recall rates for the three major summa- 
rization regimes defined by different summary granularities. 
Since segmentation effects are clearly very different across 
different sizes of source document, our experiments were 
additionally conducted at sampling the document collec- 
tion at different sizes of the originals: the corpus was split 
into four sections, grouping together documents less than 
7.5K characters long, 7.5-10K, 10-19K, and over 19K;for 
brevity, the table encapsulates a 'composite' result (denoted 
by the label "All documents "). What is of particular interest 
here is that the complete set of data from these experiments 
makes it possible, for any given document, to select dynam- 
ically the summarization strategy appropriate to its size, in 
order to get an optimal summary for it, in any given infor- 
mation compaction regime. 
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32.53 
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Table 1: Summary data for segmentation effects 

In order to get a better sense for the effects of different 
strategy mixes, we show results for the same summarization 
regimes, on subsets of the test corpus. "All documents with 
> 1 digest sentence " represents documents whose digests 
are longer than a single sentence; "All documents whose 
1st sent is not in target digest" extracts a document set for 
which a baseline strategy automatically picking a represen- 
tative sentence for inclusion in the summary would be inap- 
propriate. These subset selection criteria explain the deteri- 
oration of overall results; however, what is more interesting 
to observe in the table is the relative performance of the 
three summarization regimes. 

Overall, leveraging some of the segmentation analysis is 
positively beneficial to summarization; the effects are par- 



ticularly strong where short summaries are required. In ad- 
dition^ the, summarization procedure defined to work from 
segmentation data alone shows recall rates comparable to, 
and in certain situations even higher than, the original TEX- 
TRACT function: this suggests that such a procedure is 
certainly usable in situations where background collection- 
based salience calculation is impossible, or impractical, 

4. Seeing the topics shift 

Unlike other TEXTRACT functions, which act like lin- 
guistic filters, and typically are incorporated 'under the 
hood' in larger systems (such as query expansion in infor- 
mation; retrieval [6], or document navigation in knowledge 
management [27]), the summarization component stands 
alone. The user sees, directly, the result of summarizing a 
document; Figure 2 illustrates a typical view of a document 
and its summary. 

Without going into details (see [26]), the major charac- 
teristic of this interface is that the two windows, the sum- 
mary one at the top and the original document at the bot- 
tom, are asynchronously controlled via separate scroll bars. 
This is far from satisfactory, primarily because it makes it 
difficult to use the summary as a navigation tool into the 
complete document content. Various heuristics have been 
proposed to alleviate this problem, most of them using the 
notion of hyperiinking summary sentences with their coun- 
terparts in the full document [20], and the interface illus- 
trated here (Figure 2) employs a similar contextualisation 
device [27]. 
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Figure 2: TEXTRACT summarizes early view 

This is suboptimal, primarily because the jumps from a sen- 
tence in the summary to its position in the document are 
abrupt, and because there are no visual indicators to suggest 
how two adjacent summary sentences relate to each other (if 
at all) in the document. There may be arbitrary amount of 



material intervening, which has been omitted from the sum- 
mary: knowing this, as well as knowing the extent of the 
span of the missing material, is essential for better under- 
standing the summary [4]. In effect, the only way of mak- 
ing some sense of the summary as an abstraction of the full 
document is by being able to 'undo' the effects of ellision 
of material between each two adjacent summary sentences. 

Representing the fragments missing from the source is 
very hard to arrange for by means of a visual abstraction, 
because, as a direct consequence of the problem of under- 
representation (Section 1) in the canonical summarization 
framework, the client typically has no control over the ex- 
tent of the material which falls below the sentence salience 
threshold. However, since discourse segmentation is in- 
tended to address this problem, it also turns out to offer the 
means of a richer visual abstraction, which directly incorpo- 
rates the notion of topic shifts at the interface. We thus take 
the view that segmentation is not only a subsidiary function 
for enhancing the quality of summarization, but a process 
which is of independent utility for the end user, as long as 
its results are integrated within an appropriate interface. 5 

Figure 3 presents a screen snapshot of a prototype front 
end to a segmentation-enhanced summarizes which is ca- 
pable of contextualising summary sentences, indicating the 
span of omitted material between them, and suggesting 
grouping of summary fragments to show topic highlighting. 
A crucial feature of this interface is that the two different 
information panes, the summary one on the left and the full 
length document on the right, are synchronously scrollable; 
furthermore, both displays are 'anchored' to the segment 
span visual abstraction — the vertical bar in the middle — 
which is the primary organizational device mediating both 
the results of the summary sentence selection and topic shift 
detection. 
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Figure 3: Textract summarizer. segmentation overlay 



5 By way of informally defining the notion of "appropriate", it is worth 
noting that the representation in Figure 1 is not an appropriate end-user 
visualization of discourse segmentation. 



It is worth noting that without discourse segmentation, 
this kind of visual metaphor would be very hard to render 
on a summary stream which does not have topical informa- 
tion in it (such as illustrated in Figure 2). Due to the under- 
representation problem, the summary (left) pane might be 
too sparse; visually, this would translate into mis-cueing the 
user whether what is seen in the summary pane is a com- 
plete summary, or a fragment whose continuation is only 
reachable after (arbitrary amount of) scrolling. Additional 
problems arise from lack of any data to facilitate the user in 
identifying topics missing from the summary in what would 
be a long passage in the right pane, without any topical (or 
other) annotation. 

The interface makes use of additional features: a hy- 
perlink device to facilitate attention switching between the 
summary and document panes, while retaining focus on a 
topical sentence; colour coding for marking and displaying 
salient vocabulary items; hot spots to highlight recurring 
occurrences of a salient item. We will not discuss these in 
detail here, as they are not directly related to the integra- 
tion of segmentation and summarization functions (but see 
[26]). 

5. Conclusion 

We have addressed a class of problems inherent to 
summarization-by-sentence-extraction technology, by de- 
veloping a discourse segmentation component capable of 
detecting shifts in topic, and integrating this within a 
linguistically-aware summarizer which utilizes notions of 
salience (with respect to a background document collec- 
tion) and dynamically-adjustable size of the resulting sum- 
maries. By analyzing coherence indicators in the discourse, 
segmentation identifies points in the narrative where sub- 
stories alternate; these are used to define for the summariza- 
tion function a set of discourse segments, the representation 
of which makes for more complete, informative and faithful 
to the original summaries. 

Under certain conditions, segmentation-enhanced sum- 
marization is better than the base segmentation technology 
utilized in Textract. Some of these conditions can be ex- 
pressed as a function of the original document length, and 
the document-to-summary ratio: this makes it possible to 
select the optimal strategy for combining the two technolo- 
gies 'on the fly'. 

In addition, having access to a segmentation component 
makes it possible to alleviate a serious shortcoming of the 
Textract summarizer: in situations where background 
collection-based salience calculation is impossible, or im- 
practical, it is still possible to deliver summaries gener- 
ated by access to discourse segmentation information alone. 
These have been shown to be of comparable quality, yet 
considerably cheaper to generate. 

This work is part of a larger effort focused on leveraging 
elements of the discourse structure in an attempt to recog- 
nize and use cohesive devices in text for a variety of content 
characterisation tasks. Additional interesting extensions 



within the same space of functional enhancements would 
lead to augmenting the base-level segmentation component 
with a simple measure of 'connectedness' between any two 
discourse segments; thus, by picking different chains of co- 
hesively connected segments, different perspectives on the 
document content could be revealed; by dynamically adjust- 
ing the threshold of acceptably connected segments, sum- 
maries of different length can be generated. The hope is 
that, in either case, the resulting summaries would display 
higher degree of cohesion than that of a sequence of sen- 
tences, due to the thematically (more) complete nature of 
the discourse segments, which are the basic unit for content 
mediation in the new summaries. We are currently working 
on the infrastructure for deeper cohesion analysis. 

We are also experimenting with more dynamic inter- 
faces, capable of fully utilizing the results of multiple anal- 
yses, both in the context of single document summaries, and 
content-mediated navigation in a document collection. Ex- 
tensions and modifications to current interface metaphors 
incorporte notions like larger (and guaranteed to be the- 
matically coherent) text fragments, representative sentences 
which may be more or less central/peripheral to a given 
summary thread, multiple threads (summaries) through the 
same document source, and multi-level document abstrac- 
tions mediated via different levels of granularity of content. 
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